WIP: FlashAttention for WebGPU EP #22919

sushraja-msft · 2024-11-21T19:14:21Z

WIP: Implementation of FlashAttention that works for MHA

Currently only works on machines where the subgroup size is the same as tile size. (Intel)
Works only for the condition of new sequence length is 1.

The other scenarios require more debugging, algorithm needs optimization as well for the 1 seq length case because workgroups are left unused in how ComputeDotProduct is invoked.

sushraja-msft changed the title ~~User/sushraja/fa attempt2~~ WIP: FlashAttention for WebGPU EP Nov 21, 2024

sushraja-msft force-pushed the user/sushraja/fa_attempt2 branch from de32d1f to befc797 Compare November 22, 2024 20:30

FA implementation - doesnt work but should be all the code

e8bb833

sushraja-msft force-pushed the user/sushraja/fa_attempt2 branch from befc797 to e8bb833 Compare November 22, 2024 20:51

sushraja-msft added 2 commits November 22, 2024 16:01

Improve comments and bounds checks

940f3b0

More bug fixes

45698d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: FlashAttention for WebGPU EP #22919

WIP: FlashAttention for WebGPU EP #22919

sushraja-msft commented Nov 21, 2024 •

edited

Loading

WIP: FlashAttention for WebGPU EP #22919

Are you sure you want to change the base?

WIP: FlashAttention for WebGPU EP #22919

Conversation

sushraja-msft commented Nov 21, 2024 • edited Loading

sushraja-msft commented Nov 21, 2024 •

edited

Loading